arXiv:cs/0612058v1 [cs.DS] 10 Dec 2006

Adaptive Simulated Annealing: A Near-optimal Connection between Sampling and Counting arXiv:cs/0612058v1 [cs.DS] 10 Dec 2006 Daniel Stefankovi c Santosh Vempala December 9, 2006 Eric Vigoda Abstract We present a near-optimal reduction from approximately counting the cardinality of a discrete set to approximately sampling elements of the set. An important application of our work is to approximating the partition function Z of a discrete system, such as the Ising model, matchings or colorings of a graph. The typical approach to estimating the partition function Z( ) at some desired inverse temperature is to define a sequence, which we call a cooling schedule, 0 = 0 < 1 < · · · < = where Z(0) is trivial to compute and the ratios Z(i+1 )/Z(i ) are easy to estimate by sampling from the distribution corresponding to Z(i ). Previous approaches required a cooling schedule of length O (ln A) where A = Z(0), thereby ensuring that each ratio Z(i+1 )/Z(i ) is bounded. We present a cooling schedule of length = O ( ln A). For well-studied problems such as estimating the partition function of the Ising model, or approximating number of colorings or matchings of a graph, our cooling the schedule is of length O ( n), which implies an overall savings of O (n) in the running time of the approximate counting algorithm (since roughly samples are needed to estimate each ratio). A similar improvement in the length of the cooling schedule was recently obtained by Lov´sz and Vempala in the context of estimating the volume of convex bodies. a While our reduction is inspired by theirs, the discrete analogue of their result turns out to be significantly more difficult. Whereas a fixed schedule suffices in their setting, we prove that in the discrete setting we need an adaptive schedule, i. e., the schedule depends on Z. More precisely, we prove any non-adaptive cooling schedule has length at least O (ln A), and we present an algorithm to find an adaptive schedule of length O ( ln A). 1 Introduction This paper explores the intimate connection between counting and sampling problems. By counting problems, we refer to estimating the cardinality of a large set (or its weighted analogue), or in a continuous setting, an integral over a high-dimensional domain. The
Department of Computer Science, University of Rochester, Rochester, NY 14627, and Comenius University, Bratislava. Email: stefanko@cs.rochester.edu College of Computing, Georgia Institute of Technology, Atlanta GA 30332. Email: vempala@cc.gatech.edu. Supported by the NSF and a Guggenheim fellowship. College of Computing, Georgia Institute of Technology, Atlanta GA 30332. Email: vigoda@cc.gatech.edu. Research supported in part by NSF grant CCF-0455666.
1 sampling problem refers to generating samples from a probability distribution over a large set. The well-known connection between counting and sampling is the starting point for popular Markov chain Monte Carlo methods for many counting problems. Some notable examples from computer science are the problems of estimating the volume of a convex body [3, 11] and approximating the permanent of a non-negative matrix [9]. In statistical physics, a key computational task is estimating a partition function, which is an example of a counting problem. Evaluations of the partition function yield estimates of thermodynamic quantities of interest, such as the free energy and the specific heat. The corresponding sampling problem is to generate samples from the so-called Gibbs (or Boltzman) distribution. We present an improved reduction from approximate counting to approximate sampling. These results improve the running time for many counting problems where efficient sampling schemes exist. We present our work in the general framework of partition functions from statistical physics. This framework captures many well-studied models from statistical physics, such as the Ising and Potts models, and also captures many natural combinatorial problems, such as colorings, independent sets, and matchings., For the purpose of this paper we define a (discrete) partition function as follows. Definition 1.1. Let n 0 be an integer. Let a0 , . . . , an be non-negative real numbers such that a0 1. The function
n Z() =
i=0 ai e-i is called a partition function of degree n. Let A := Z(0). This captures the standard notion of partition functions from statistical physics in the following manner. The quantity i corresponds to the possible values of the Hamiltonian. Then ai is the number of configurations whose Hamiltonian equals i. For instance, in the (ferromagnetic) Ising model on a graph G = (V, E), a configuration is an assignment of +1 and -1 spins to the vertices. The Hamiltonian of a configuration is the number of edges whose endpoints have different spins. The quantity is referred to as the inverse temperature. The computational goal is to compute Z() for some choice of 0. Note, when = 0 the partition function is trivial since Z(0) = n ai = 2|V | . The condition i=0 a0 1 is clearly satisified, in fact, we have a0 = 2 by considering the all +1 and the all -1 configurations. The general notion of partition function also captures standard combinatorial counting problems as illustrated by the following example. Let be the set of all k-labelings of a graph G = (V, E) (i. e., labelings of the vertices of G by numbers {1, . . . , k}). Given a labeling , let its Hamiltonian H() be the number of edges in E that are monochromatic in . Let i denote the set of all k-labelings of G with H() = i. Let ai = |i |. We would like to compute Z() = a0 , i. e., the number of valid k-colorings of G. Once again, the case = 0 is trivial since we have Z(0) = k|V | . The condition a0 1 simply requires that there is at least one proper k-coloring. The standard approach to compute Z() is to express it as a telescoping product of ratios of the partition function evaluated at a sequence of 's, where the initial = 0 is the trivial case. The ratios are approximated using a sampling algorithm. More precisely, consider a set of configurations which can be partitioned as = 0 1 · · · n , 2 where |i | = ai for 0 i n. Suppose that we have an algorithm which for any inverse temperature 0 generates a random configuration from the distribution µ over where the probability of a configuration is µ () = e-H() , Z() (1) where H() is the Hamiltonian of the configuration defined as H() = i such that i . We now describe the details of the standard approach for using such a sampling algorithm to approximately evaluate Z(). In the general setting of Definition 1.1, for X µ , the random variable (2) W, := e(- )H(X) is an unbiased estimator for Z( )/Z(). Indeed, E W, = 1 Z() e-H() · e(- )H() =
Z( ) . Z() (3) Thus, a0 = Z() can be approximated as follows. Take 0 < 1 < · · · < with 0 = 0 and = . Express Z() as a telescoping product Z() = Z(0) Z(1 ) Z(2 ) Z( ) ... , Z(0 ) Z(1 ) Z(-1 ) (4) and approximate each fraction in the product using the unbiased estimator Wi ,i+1 . The initial term Z(0) is typically trivial to compute. Taking sufficiently many samples for each Wi ,i+1 will give a good approximation of a0 . The question we study in this paper is: how should one choose the inverse temperatures 0 , . . . , so as to minimize the number of samples needed to estimate (4)? A specific choice of 0 , . . . , is called a cooling schedule. In the past, MCMC algorithms have used cooling schedules that ensure that each ratio in the telescoping product is bounded by a constant. Dyer and Frieze [2] used a non-trivial application of Chebyshev's inequality to show that O() samples per ratio are sufficient to obtain an (1 ± ) approximation of Z(). For applications such as colorings or Ising model, requiring that each ratio is at most a constant, implies that the length of the cooling schedule is at least (n), since Z(0) and Z() typically differ by an exponential factor. A general cooling schedule of length O (n) was presented in Bez´kova et al [1]. All schedules prior to our work use non-adaptive a cooling schedules. By non-adaptive we refer to a schedule that depends only on n and A but not the structure of Z. The recent volume algorithm of [11, 12] uses a non-adaptive cooling schedule of length O( n) to estimate the volume of a convex body in Rn . Their result relies on the logconcavity of the function n Z() where Z is the analogue of the partition function. Here we present a cooling schedule for discrete partition functions with length roughly ln A where A = Z(0). (Note, ln A is roughly n in the examples we have been 3 considering here). The discrete setting presents the following new challenge. As we show in this paper, there can be no short non-adaptive cooling schedule for discrete partition functions. Any such non-adaptive schedule has length (ln A) in the worst case. (We defer precise statements of our results until we formally present the background material.) Our main result is that every partition function does have an adaptive schedule of length roughly ln A. Further, the schedule can be figured out efficiently on the fly. The existence of a short schedule follows from an interesting geometric fact: any convex function f can be approximated by a piecewise linear function g consisting of few pieces, see Figure 1 in Section 4 for an illustration. More precisely, f is approximated in the following sense: for all x 0, we have 0 g(x) - f (x) 1. For well-known problems such as counting colorings or matchings, and estimating the partition function of the Ising model, our results imply an improvement in the running time by a factor of n, since the complexity grows with the square of the schedule length; see Section 8 for a precise statement of the applications of our results. We observe (in Section 4.1) that our techniques apply to the continuous setting as well, specifically, to the integration of general functions in Rn . The key property required for the existence of an adaptive schedule is the logconvexity of the partition function Z(). However, this does not immediately lead to any new algorithms for integration since logconcave functions are the most general class of continuous functions for which we have efficient sampling algorithms. In Section 2 we formalize the setup described in this introduction. The lower bound for non-adaptive schedules is formally stated as Lemma 3.3 in Section 3. The existence of a short cooling schedule is proved in Section 4, and formally stated in Theorem 4.1. The algorithm for constructing a short cooling schedule is presented in Section 5. Finally, in Section 8 we present applications of our improved cooling schedule. 2 Chebyshev cooling schedules Let W := W, be the estimator defined by (2) whose expectation is a individual ratio in the telescoping product. As usual, we will use the squared coefficient of variance Var (W )/E (W )2 as a measure of the quality of the estimator W , namely to derive a bound on the number of samples needed for reliable estimation of E (W ). We will also use the quantity E W 2 /E (W )2 = 1 + Var (W )/E W 2 . Lemma 2.1 (Chebyshev). Let W be a random variable with E (W ) < and E W 2 < . Let > 0. We have P (1 - )E (W ) W (1 + )E (W ) 1 - E W2 Var (W ) 1- . 2 E (W )2 2 E (W )2 The following lemma of Dyer and Frieze [2] is now well-known. Theorem 2.2. Let W1 , . . . , W be independent random variables with E Wi2 /E (Wi )2 B for i []. Let W = W1 . . . W . Let Si be the average of 16B/2 independent random samples from Wi for i []. Let S = S1 S2 · · · S . Then Pr (1 - )E W S (1 + )E W 4 3/4. It will be convenient to rewrite E W 2 /E (W )2 for W := W, in terms of the partition function Z. We have E W2 = and hence 1 Z() e-H() e2(- )H() = Z(2 - ) , Z() E W2 E (W )2 = Z(2 - )Z() . Z( )2 (5) Equation (5) motivates the following definition. Definition 2.3. Let B > 0 be a constant. Let Z be a partition function. Let 0 , . . . , be a sequence of inverse temperatures such that 0 = 0 < 1 < · · · < = . The sequence is called a B-Chebyshev cooling schedule for Z if Z(2i+1 - i )Z(i ) B, Z(i+1 )2 for all i = 0, . . . , - 1. The following bound on the number of samples is an immediate consequence of Theorem 2.2. Corollary 2.4. Let Z be a partition function. Suppose that we are given a B-Chebyshev cooling schedule 0 , . . . , for Z. Then, using 16B2 /2 samples in total, we can compute S such that P (1 - )Z() S (1 + )Z() 3/4. (6) 3 Lower bound for non-adaptive schedules A cooling schedule will be called non-adaptive if it depends only on n and A = Z(0) and assumes Z() 1. Thus, such a schedule does not depend on the structure of the partition function. The advantage of non-adaptive cooling schedules is that they do not need to be figured out on the fly. An example of a non-adaptive Chebyshev cooling schedule that works for any partition function of degree n, where Z(0) = A, is n ln A 1 2 , . 0, , , . . . , n n n (7) The idea behind the schedule (7) is that small changes in the inverse temperature result in small changes of the partition function. We will state this observation more precisely, since we will use it later. Lemma 3.1. Let > 0 and let + . Let Z be a partition function of degree n. Then Z()e-n Z( ) Z(). (8) 5 Proof : For i n we have e-i e-n e-(+)i e- i e-i . (9) Equation (8) now follows by applying (9) to each term of the Z's in (8). To see that (7) is a Chebyshev cooling schedule, note that, by Lemma 3.1, the random variable W, defined by (2) has values from the interval [1/e, 1] if 0 - 1/n. This implies that for W := W, the left-hand side of (5) is bounded by a constant if , < are neighbors in (7). It remains to show that (5) is bounded for = ln A and = . Note that that Z() 1 (since a0 1) and
n Z(ln A) = a0 +
i=1 ai e -i ln A 1 Z() + A n i=1 ai Z() + 1. and hence for the right-hand side of (5) we obtain Z(ln A) 2. Z() (10) The length of the schedule (7) is O(n ln A). The following more efficient non-adaptive Chebyshev cooling schedule of length O((ln A) ln n) is given in [1]: 1 2 k k k 2 k t 0, , , . . . , , , ,..., , , n n n n n n (11) where k = ln A, = 1 + ln1A , and t = (1 + ln A) ln n. The schedule (11) is based on the following observation (the statement of Lemma 3.2 slightly differs from [1] and hence we include a short proof). Lemma 3.2 ([1]). Let Z be a partition function with Z(0) = A. Let > 0 be an inverse temperature and let = (1 + ln1A ). Then 1 Z() Z( ). 2e Proof : Let n be the degree of Z. First assume that an e-n 1. We have an Z(0) = A and 1 hence ln A . This implies + n and we can use Lemma 3.1. n -n < 1. Let k {0, . . . , n} be the smallest such that Now assume an e
n ai e-i < 1.
i=k (12) Note that k 1, since a0 1. From the minimality of k we obtain
n Ae-(k-1) i=k-1 ai e-i 1, 6 and hence (k - 1) ln A. Hence for i k - 1 we have i i + 1. Now
k-1 Z() < 1 +
i=0 ai e-i , (13) and Z( ) k-1 i=0 k-1 ai e- i i=0 ai e-i-1 1 e k-1 i=0 1 ai e-i . e (14) Combining (13) and (14) we obtain the result. Next we show that the schedule (11) is the best possible up to a constant factor. We will see later that adaptive cooling schedules can be much shorter. Lemma 3.3. Let n Z+ , and A, B R+ . Let S = 0 , 1 , . . . , be a non-adaptive B-Chebyshev cooling schedule which works for all partition functions of degree at most n with Z(0) = A, and Z() 1. Assume 0 = 0 and = . Then ln(n/e) ln(A - 1) -1 . ln(4B) (15) In the proof of Lemma 3.3 we will need the following bound on the first step of the cooling schedule. Lemma 3.4. Assume A - 1 > 4B. Then 1 ln(4B) . n (16) Proof of Lemma 3.4: Let 0 a A - 1. Then S has to be a B-Chebyshev cooling schedule for Z() = A 1 + ae-n . 1+a The equation (6) needs to be satisfied for Z, 0 = 0 and 1 . Thus (1 + ae-21 n )(1 + a) B. (1 + ae-1 n )2 After substitution z = e-1 n , equation (17) becomes equivalent to (1 + az 2 )(1 + a) =1+a (1 + az)2 1-z 1 + az
2 (17) B. (18) 1 Suppose that z A-1 . Note that the left-hand side of (18) is decreasing in z. Hence, (18) 1 is true for z = A-1 . Let a = A - 1. For this choice of a and z, (18) yields (A - 1)/4 B, 1 a contradiction with A > 4B + 1. Thus, z > A-1 . Since z > 1/(A - 1), we have 1/z < A - 1 and, hence, we can choose a = 1/z. Plugging a = 1/z into (18) we obtain (1 + z)2 B, (19) 4z 7 and, hence, z 1/(4B), which implies (16). The Lemma 3.4 immediately gives a bound on the later steps in the schedule. If the current inverse temperature is i then the coefficient of degree k is decimated by e-i k . Corollary 3.5. Let k {1, . . . , n}. Assume (A - 1)e-i k > 4B. Then i+1 - i ln(4B) . k Proof of Lemma 3.3: Let S = 0 , 1 , . . . , be the shortest sequence such that 0 = 0, = and the Corollary 3.5 is satisfied for S . We can greedily construct the shortest sequence S as follows. If k {1, . . . , n} is the largest such that (A - 1)e-i k > 4B then we take i+1 = i + ln(4B) . k
ln(4B) . i (If (A - 1)e-i 4B then we take i+1 = .) Let xi be the number of indices for which i+1 - i =
n Let j {2, . . . , n} and (20) =
i=j xi ln(4B) . i From we take a step of length at least ln(4B) (since we already took all shorter steps) j-1 and hence (A - 1)e-j 4B. (21) Plugging (20) into (21) we obtain
n xi
i=j 1 A-1 ln(4B) ln . i j 4B (22) which implies (15). Summing (22) for j = 2, . . . , n we obtain n n 1 A - 1 A-1 n xi (ln(4B)) ln ln , ln j 4B e 4B
j=2 j=2 The number of samples needed in Theorem 2.2 (and Corollary 2.4) is linear in B and hence, in view of Lemma 3.3, the optimal value of B is a constant. Our understanding of non-adaptive schedules is now complete up to a constant factor. In particular, the schedule (11) and Lemma 3.3 imply that the optimal non-adaptive schedule has length ((ln A) ln n). We would like to have a similar understanding of adaptive cooling schedules. A reasonable conjecture is that the optimal adaptive schedule has length (ln A) ln n . 8 (23) We will present an adaptive schedule of length O ln A(ln n) ln ln A . This comes reasonably close to our guess in (23) (in fact, in our applications we are only off by polylogarithmic factors). We will have the following technical assumptions on A and n. ln n 1, ln ln A 1, and A ln n. (24) The first two assumptions are necessary since both ln n and ln ln A figure in our bounds on the length of the schedule. The third assumption is justified for the following two reasons. First, in the applications we consider, A is usually exponential in n. Second, if A is too small then no cooling schedule is necessary - a direct application of the Monte Carlo method uses only A/2 samples (which, for A ln n, is less than the number of samples needed by a cooling schedule of length given by (23)). 4 Adaptive cooling schedules In this section, we prove the existence of short adaptive cooling schedules for general partition functions. We now formally state the result (to simplify the exposition we will choose B = e2 , the construction works for any B). Theorem 4.1. Let Z be a partition function of degree n. Let A = Z(0). Assume that Z() 1. There exists an e2 -Chebyshev cooling schedule S for Z whose length is at most 4(ln ln A) (ln A) ln n. It will be convenient to define f () = ln Z(). Some useful properties of f are summarized in the next lemma. We include a short proof in Section 6. Lemma 4.2. Let f () = ln Z() where Z is a partition function of degree n. Then (a) f is decreasing, (b) f is increasing (i. e., f is convex) (c) f (0) -n. Recall that an e2 -Chebyshev cooling schedule for Z is a sequence of inverse temperatures 0 , 1 , . . . , such that 0 = 0, = , and Z(2i+1 - i )Z(i ) e2 . Z(i+1 )2 (25) Since (25) is invariant under scaling we can, without loss of generality, assume Z() = 1 (or equivalently a0 = 1). Since we assumed a0 1 the scaling will not increase Z(0). Let f () = ln Z(), so that f (0) = ln A, and f () = 0. The condition (25) is equivalent to f (2i+1 - i ) + f (i ) - f (i+1 ) 1. (26) 2 If we substitute x = i and y = 2i+1 - i , the condition can be rewritten as f x+y 2 f (x) + f (y) - 1. 2 In words, f satisfies approximate concavity. The main idea of the proof is that we do not require this property to hold everywhere but only in a sparse subset of points which 9 will correspond to the cooling schedule. A similar viewpoint is that we will show that f can be approximated by a piecewise linear function g with few pieces, see Figure 1 for an illustration. We form the segments of g in the following inductive, greedy manner. Let i denote the endpoint of the last segment. We then set i+1 as the maximum value such that the midpoint mi of the segment (i , i+1 ) satisfies (26) (for i = i , i+1 = mi ). We now formally state the lemma on the approximation of f by a piecewise linear function. Lemma 4.3. Let f : [0, ] R be a decreasing, convex function. There exists a sequence 0 = 0 < 1 < · · · < j = such that for all i {0, . . . , j - 1}, f and j 1+ (f (0) - f ()) ln f (0) . f () i + i+1 2 f (i ) + f (i+1 ) - 1, 2 (27) Proof : Let 0 := 0. Suppose that we already constructed the sequence up to i . Let i+1 be the largest number from the interval [i , ] such that (27) is satisfied. Let mi = (i + i+1 )/2, let i = (i+1 - i )/2, and Ki = f (i ) - f (i+1 ). If i+1 = then we are done constructing the sequence. Otherwise, by the maximality of i+1 , we have f (i ) + f (i+1 ) - 1. (28) f (mi ) = 2 Using the convexity of f and (28) we obtain - f (i ) and - f (i+1 ) Combining (29) and (30) we obtain -f (i+1 ) Ki - 2 4 f (i+1 ) = =1- . ( ) ( ) f i -f i Ki + 2 Ki + 2 (31) f (i ) - f (mi ) Ki + 2 = , 2 Ki - 2 f (f (mi ) - i+1 ) = . 2 (29) (30) From (30) and the fact that f is decreasing we obtain Ki 2. Hence we can estimate (31) as follows 4 1 f (i+1 ) 1- 1- e-1/Ki . (32) ( ) f i Ki + 2 Ki Since f is decreasing, we have
j-2 i=0 Ki f (0) - f (). (33) Now we combine (32) for all i {0, . . . , j - 2}.
j-2 i=0 f (0) 1 . ln Ki f () 10 (34) 14 12 10 8 f(x) 6 4 2 0 1 2 inverse temperature 3 4 Figure 1: The light curve is f (x) = ln Z(x) for the partition function Z(x) = (1 + exp(-x))20 . The dark curve is a piecewise linear function g consisting of 3 pieces which approximates f . In particular, g f and the midpoint of each piece is close to the average of the endpoints (specifically, (25) holds). Applying Cauchy-Schwarz inequality on (33) and (34) we obtain (j - 1)2 (f (0) - f ()) ln f (0) . f () The construction immediately yields a natural cooling schedule. A schedule ending at k = i , can now be extended by k+1 = mi where mi is the midpoint of the segment (i , i+1 ). Moreover, we can then set k+2 as the midpoint of (mi , i+1 ). We continue in this geometric manner for at most ln ln A steps, after which we can set the next inverse temperature in our schedule to i+1 . Then we continue on the next segment. It then follows that the length of the cooling schedule satisfies j ln ln A where j is the length of the sequence from Lemma 4.3. We now present the proof of the Theorem 4.1. Proof of Theorem 4.1: Let be such that f () = 1. We describe a sequence 0 = 0 < 1 < . . . = satisfying (26). Note that since f () = 1, we can take +1 = and the sequence will still satisfy (26) (and thus we get a complete e2 -Chebyshev cooling schedule for Z). We have
n Z() = exp(f ()) =
i=0 ai e-i = e, and, hence, (using a0 = 1)
n -Z () = i=0 iai e-i e - 1. 11 Thus - f () = - ln Z() = -Z () = Z() n -i i=0 iai e n -i i=0 ai e e-1 . e (35) By Lemma 4.3, there exists a sequence of 0 = 0 < 1 < · · · < j = of length j 1+ (ln A) ln ne e-1 (36) is an e2 -Chebyshev cooling schedule. Let g(x) = f such that (27) is satisfied. Now we show how to add ln ln A inverse temperatures between each pair i and i+1 to obtain our cooling schedule. For notational convenience we show this only for 0 = 0 and 1 . Note that (27) implies that (26) is satisfied for 0 = 0 and 1 = 1 /2. We now show that 0, (1/2)1 , (3/4)1 , (7/8)1 , . . . , (1 - 2-ln ln A )1 , 1 1 + x 2 f (x) + f (1 ) . 2 - Note that by (28) we have g(0) = -1. We have g (x) = Thus if x 1 we have g (x) 0, and, hence, Plugging in x = (1 - 2-t )1 we conclude f ((1 - 2-t-1 )1 ) g(x) g(0) = -1. f ((1 - 2-t )1 ) + f (1 ) - 1. 2 (37) 1 2 f 1 + x 2 - f (x) . (38) From (28) and (38) it follows that the sequence 0, (1/2)1 , (3/4)1 , (7/8)1 , . . . , (1 - 2t )1 , 1 (39) satisfies (26). We will now show that we can truncate the sequence at t = ln ln A and take the next step to 1 . By the convexity of f f ((1 - 2-t-1 )1 ) and hence f ((1 - 2-t-1 )1 ) - f (1 ) 12 f ((2 - 2-t )1 ) + f (1 ) , 2 f ((1 - 2-t )1 ) - f (1 ) . 2 (40) The equation (40) states that the distance of f ((1 - 2-t )1 ) from f (1 ) halves in each step. Recall that f (1 /2) - f (1 ) f (0) ln A and, hence, for t := ln ln A we have f ((1 - 2-t )1 ) - f (1 ) 1. (41) This completes the construction of the cooling schedule. The length of the schedule is jt. Plugging in (36) yields the theorem. The optimal Chebyshev cooling schedule can be obtained in a greedy manner. In particular, starting with 0 = 0, and then from i , choosing the maximum i+1 for which (26) is satisfied. The reason why the greedy strategy works is that if we can step from to , then for any [, ] we can step from to (i. e., having large inverse temperature can not hurt us). The last fact follows from the convexity of f (or alternatively from (37)). Corollary 4.4. Let Z be a partition function of degree n. Let A = Z(0). Assume that Z() 1. Suppose that 0 < · · · < is a cooling schedule for Z. Then the number of indices i for which Z(2i+1 - i )Z(i ) e2 (42) Z(i+1 )2 is at most 4(ln ln A) (ln A) ln n. We now formally prove the greedy property of Chebyshev cooling schedules. Note that we can make a step from x to y if g(x, y) 1, where g(x, y) = f (x) + f (2y - x) - f (y) . 2 (43) Lemma 4.5. Let Z be a partition function. Let f = ln Z() and let g be given by (43). The function g(x, y) is decreasing in x for x < y. The function g(x, y) is increasing in y for x < y. Proof : By Lemma 4.2 we have that f is an increasing function. We have 2y - x > x and hence f (x) - f (2y - x) g(x, y) = < 0. x 2 Analogously g(x, y) = f (2y - x) - f (y) > 0. y Proof of Corollary 4.4: Let k0 < k1 < · · · < km be the indices for which (42) is satisfied. Let 0 = 0 < 1 < · · · < = be the optimal e2 -Chebyshev cooling schedule. We are going to show, using induction on j, that j kj . (44) Clearly (44) is true for j = 0. 13 Now assume (44) is true for some j. We have g(j , j+1 ) 1, g(i , i+1 ) 1, and j i . From Lemma 4.5 it follows that j+1 i+1 and hence j+1 i+1 kj+1 , completing the induction step. Equation (44) implies m 4(ln ln A) (ln A) ln n. 4.1 Extensions The key property of Z() used in the proof of existence of a fast cooling schedule is the fact that it is logconvex (i. e., its logarithm, f () = ln Z(), is convex). The proof above can be appropriately modified for other function classes with this property. We highlight this for a class of continuous functions. Lemma 4.6. Let g : Rn R be a continuous, integrable, nonnegative function. Define Z() =
Rn g(x) dx for > 0. Then Z() is logconvex. The proof is identical to that of Lemma 4.2, part(b). 4.2 Lower bound for adaptive cooling Lemma 4.7. Let n 1. Consider the following partition function of degree n: Z() = (1 + e- )n . Any B-Chebyshev cooling schedule for Z() has length at least n/(20 ln B). Proof : Let f () = ln Z() = n ln(1 + e- ). If the current inverse temperature is i =: , the next inverse temperature i+1 =: + x has to satisfy f () + f ( + 2x) - 2f ( + x) ln B. Later we will show that for any [0, 1] and x [0, 1] we have f () + f ( + 2x) - 2f ( + x) n 2 x . 20 (45) From (45) it follows that for 1 the inverse temperature increases by at most x 20 ln B , n n/(20 ln B). and, hence, the length of the schedule is at least It remains to show (45). Let g(x, ) := f () + f ( + 2x) - 2f ( + x) . 2n 14 We have e--x e--2x g(x, ) = - . x 1 + e--x 1 + e--2x e--x e--2x - x/20, (46) 1 + e--x 1 + e--2x which will imply (45) (by integration over x). Let C := e- and y := 1 - e-x . Note that C [1/e, 1], y [0, 1 - 1/e], and x = - ln(1 - y). For y [0, 1 - 1/e] we have - ln(1 - y) y + y 2 and hence it is enough to show C(1 - y)2 1 C(1 - y) - (47) (y + y 2 ). 2 1 + C(1 - y) 1 + C(1 - y) 20 Multiplying both sides by the numerators we obtain that (47) is equivalent to P (y, C) := y(y + 1)(y - 1)3 C 2 - (y 4 - 2y 3 + 19y 2 - 18y)C - (y 2 + y) 0. We will show The polynomial y(y + 1)(y - 1)3 is negative for our range of y and hence for any fixed y, the minimum of P (y, C) over C [1/3, 1] occurs either at C = 1 or at C = 1/3 (we only need to show positivity of P (y, C) for C [1/e, 1], but for numerical convenience we show it for a larger interval). We have p(y, 1) = y 5 - 3y 4 + 2y 3 - 18y 2 + 16y, and Both (48) and (49) are non-negative for our range of y (as is readily seen by the method of Sturm sequences). This finishes the proof of (46), which in turn implies (45). 9p(y, 1/3) = y 5 - 5y 4 + 6y 3 - 64y 2 + 44y. (49) (48) 5 An adaptive cooling algorithm The main theorem of the previous section proves the existence of a short adaptive cooling schedule, whereas in Section 3 we proved any non-adaptive cooling schedule is much longer. In this section, we present an adaptive algorithm to find a short cooling schedule. We state the main result before describing the details of the algorithm. The algorithm has access to a sampling oracle, which on input produces a random sample from the distribution µ , defined by (1) (or a distribution sufficiently close to µ ). Theorem 5.1. Let Z be a partition function. Assume that we have access to an (approximate) sampling oracle from µ for any inverse temperature . Let > 0. With probability at least 1 - , algorithm Print-Cooling-Schedule outputs a B-Chebyshev cooling schedule for Z (with B = 3 · 106 ), where the length of the schedule is at most (50) 38 ln A(ln n) ln ln A. The algorithm uses at most 1 (51) samples from the µ -oracles. The samples output by the oracles have to be from a distribution µ which is within variation distance /(2Q) from µ . Q 107 (ln A) (ln n) + ln ln A
5 ln 15 In Section 7 we extend the algorithm to the setting of warm-start sampling oracles (see Theorem (7.5)). 5.1 High-level Algorithm Description We begin by presenting the high-level idea of our algorithm. Ideally we would like to find a sequence 0 = 0 < 1 < · · · < = such that, for some constants 1 < c1 < c2 , for all i, the random variable W := Wi ,i+1 satisfies c1 E W2 E (W )2 c2 . (52) The upper bound in (52) is necessary so that Chebyshev's inequality guarantees that few samples of W are required to obtain a close estimate of the ratio Z(i )/Z(i+1 ). On the other side, the lower bound would imply that the length of the cooling schedule is close to optimal. We will guarantee the upper bound for every pair of inverse temperatures, but we will only obtain the lower bound for a sizable fraction of the pairs. Then, using Corollary 4.4, we will argue that the schedule is short. During the course of the algorithm we will try to find the next inverse temperature i+1 so that (52) is satisfied. For this we will need to estimate u = u(i , i+1 ) := E W 2 /E (W )2 . We already have an expression for u, given by equation (5): u= E W2 E (W )
2 = Z(2i+1 - i ) Z(i ) Z(2i+1 - i )Z(i ) . = 2 Z(i+1 ) Z(i+1 ) Z(i+1 ) (53) Hence, to estimate u it suffices to estimate the ratios Z(2i+1 -i )/Z(i ) and Z(i )/Z(i+1 ). Recall that the goal of estimating u was to show that W is an efficient estimator of Z(i )/Z(i+1 ). Now it seems that to estimate u we already need a good estimator for W . An important component of our algorithm, which allows us to escape from this circular loop, is a rough estimator for u which bypasses W . Recall, the Hamiltonian H takes values in {0, 1, . . . , n}. For the purposes of estimating u it will suffice to know the Hamiltonian within some relative accuracy. Thus, we partition {0, 1, . . . , n} into (discrete) intervals of roughly equivalent values of the Hamiltonian. Since we need relative accuracy the size of the interval is smaller smaller values of the for Hamiltonian (specifically, value i is an interval of size about i/ ln A). We let P denote the set of intervals. We will define the intervals so that the number of intervals |P | is at most O( ln A ln n). The rough estimator for u needs an interval I = [b, c] {1, . . . , n} which contributes a significant portion to Z() for all [i , 2i+1 - i ]. We say such an I is heavy for that interval of inverse temperatures. Thus, if we generate a random sample from µ we have a significant probability that the sample is in the interval I. The key observation is that if an interval I is heavy for inverse temperatures 1 and 2 , then by generating samples from µ1 and µ2 , and looking at the proportion of samples whose Hamiltonian falls into interval I, we can roughly estimate Z(2 )/Z(1 ). Thus, if an interval I is heavy for an interval of inverse temperatures B = [i , ], then we can find a i+1 B = [i , (i + )/2] satisfying (52) (making an optimal move in some sense) or determine there is no such i+1 B . 16 In the later case we construct a sequence of inverse temperatures that goes from i to where the upper bound in (52) holds for this sequence. We will show that O(ln ln A) intermediate inverse temperatures are sufficient to go from i to (the construction is analogous to the sequence (39) in the proof of Theorem 4.1). Once we reach we will be done with this interval I and will not need to consider it again. An important fact is that for an interval I, the set of 's where I is heavy is itself an interval. Hence, each interval causes a non-optimal step at most once (causing a sequence of O(ln ln A) intermediate inverse temperatures). Thus, our algorithm will find a cooling schedule whose length is at most (54) O (ln ln A) (ln A) ln n + ln A(ln n) ln ln A , where the first term comes from Theorem 4.1 and the second term comes from the upper bound on |P | = O( ln A(ln n) and the fact that the non-optimal steps cause the algorithm to output a sequence of O(ln ln A) intermediate inverse temperatures. To simplify the high-level exposition of the algorithm we glossed over a technical aspect of the rough estimator which sometimes does not allow a move long enough to finish off the interval I. Such a move will be long relative to the reciprocal of the width of the I and will be referred to as "long" step. ("Long" steps will be analyzed by a separate argument, and their number will be smaller than (54).) Thus, in the detailed description of the algorithm we will have three kinds of steps: "optimal" steps, "interval" steps, and "long" steps. Combining Theorem 5.1 with Corollary 2.4 we obtain. Corollary 5.2. Let Z be a partition function. Let > 0 be the desired precision. Suppose that we are given access to oracles which sample from the distribution within variation distance 2 108 (ln A) (ln n) + ln ln A
5 from µ for any inverse temperature . 10 5 Using 102 (ln A) (ln n) + ln ln A samples in total, we can obtain a random variable S such that P (1 - )Z() S (1 + )Z() 3/4. 5.2 Detailed Algorithm Description Here we present a detailed description of the algorithm. We also present pseudocode for the algorithm in Section 10 of the Appendix. First, we construct a partition P of {0, . . . , n} into O( ln A ln n) disjoint intervals. We construct P inductively, starting with interval [0, 0]. Suppose that {0, . . . , b - 1} is already partitioned. Let w := b/ ln A. (55) Add the interval [b, b + w] to P and continue inductively on {b + w + 1, . . . , n}. Note, the initial ln A intervals are of size 1 (i. e., contain one natural number), and have width 0. Later (in Section 5.3) we will show the following explicit upper bound on the number of intervals in P . 17 Lemma 5.3. |P | 4 ln A ln n. In each stage of the algorithm we want an interval which is heavy in the following precise sense. Definition 5.4. Let Z be a partition function. Let R+ be an inverse temperature. Let I = [b, c] {0, . . . , n} be an interval. For h (0, 1), we say that I is h-heavy for , if for X chosen from µ , we have Pr (H(X) I) h. The following property will be crucial for our algorithm: the set of inverse temperatures for which an interval I is heavy is itself an interval (in R+ ), the proof is deferred to Section 6. Lemma 5.5. Let Z be a partition function. Let I = [b, c] {0, . . . , n} be an interval. Let h (0, 1]. The set of inverse temperatures for which I is h-heavy forms an interval (possibly empty). Let h := 1 . 8|P | (56) In our algorithm we will use an interval which is h-heavy. Given access to a sampler for X µ one can approximately check whether an interval is h-heavy for . More precisely, we can distinguish the case when I is h-heavy versus when I is not 4h-heavy. We formalize this observation in Lemma 5.7. First we need the following definition. Definition 5.6. Let Z be a partition function. Let I = [b, c] {0, . . . , n} be an interval. Let (0, 1] and let be an inverse temperature. Let X µ and let Y be the indicator function for the event H(X) I. Let s = (8/h) ln 1 . Let U be the average of s independent samples from Y . Let Is-Heavy(I, ) = true if U 2h false if U < 2h Lemma 5.7. If I is not h-heavy at inverse temperature , then Pr (Is-Heavy(I, ) = true) . If I is 4h-heavy at inverse temperature , then Pr (Is-Heavy(I, ) = false) . (58) (57) The above lemma is proved in Section 6. If we take s = (8/h) ln 1 samples from µ , and take the interval which received the most samples, then we are likely to get an h-heavy interval. Note that by our choice of h there exists a 8h-heavy interval J. By Lemma 5.7, it is very likely that J receives more than 2hs samples and that all intervals which are not h-heavy receive less than 2hs samples. Thus, the interval with the most samples will likely be h-heavy. 18 Corollary 5.8. Given an inverse temperature , using s = (8/h) ln 1 samples from µ we can find an h-heavy interval. The failure probability of the procedure is at most |P |. We will need a more general version of Corollary 5.8 in which the set of intervals that can be chosen is restricted. The forbidden intervals will not be 8h-heavy and, hence, there will exist an allowed interval which is 8h-heavy. Using the same reasoning as we used for Corollary 5.8 we obtain the following procedure, which we call Find-Heavy. Corollary 5.9. Let be an inverse temperature. Let Bad be a set of intervals such than no interval in Bad is 8h-heavy at . Given an inverse temperature , using s = (8/h) ln 1 samples from µ we can find an h-heavy interval which is not in Bad. The failure probability of the procedure Find-Heavy is at most |P |. We use the following idea: if a narrow interval is heavy for two nearby inverse temperatures 1 , 2 then the interval can be used to estimate the ratio of Z(1 ) and Z(2 ). Lemma 5.10. Let Z be a partition function. Let I = [b, c] {0, . . . , n} be an interval. Let (0, 1]. Suppose that I is h-heavy for inverse temperatures 1 , 2 R+ . Assume that |1 - 2 | · (c - b) 1. (59) For k = 1, 2 we define the following. Let Xk µk and let Yk be the indicator function for the event H(Xk ) I. Let s = (8/h) ln 1 . Let Uk be the average of s independent samples from Yk . Let U1 Est(I, 1 , 2 ) := exp(b(1 - 2 )). (60) U2 With probability at least 1 - 4 we have 4eZ(2 ) Z(2 ) Est(I, 1 , 2 )) . 4eZ(1 ) Z(1 ) The above lemma is proved in Section 6. Remark 5.11. (on imperfect sampling) In the description of our algorithms we will assume that we can perfectly sample from the distributions µ . Of course, in applications we can only sample from distributions which are at a small variation distance from µ . Our algorithms will still work, as the following, standard, coupling trick shows. We can couple the biased distributions and the perfect distributions so that they differ with probability . If we take t samples total then, by union bound, with probability at least 1 - t the algorithm with biased samplers will have the same output as the algorithm with perfect samplers. Remark 5.12. (on randomization) The randomness in our algorithm will come from the procedures Est and Is-Heavy. The failure probability parameter will be chosen very small so that during the execution of the algorithm no failures of Est and Is-Heavy occur with high probability (formally, we use the union bound). Thus in the proof of correctness we will ignore the possibility of failure of these procedures and deal with the errors separately. (61) 19 Remark 5.13. (on binary search) In our algorithm we will have to (approximately) find the right-most point in an interval [a, b] which satisfies a given predicate . The predicate will be such that (a) is true. We use the binary search on an interval [a, b] in the following manner. If (b) is true then we return b. Otherwise we set = a, = b and perform binary search until - , where is the precision. Note that in the end we will have () is false and () is true. We return . We now give a detailed description of our algorithm for constructing the cooling schedule. Let be the desired final error probability of our algorithm. We will call the procedures Est, Is-Heavy, and Find-Heavy with the same value of , which will be chosen as follows: . (62) = 1600(ln n)2 (ln A)2 Let 1 s = (8/h) ln . We will keep a set Bad of banned intervals which is initially empty. Note it suffices to have the penultimate in the sequence be i-1 = ln A, since we can then set i = (see equation (10)). The algorithm for constructing the sequence works inductively. Thus, consider some starting 0 . 1. We first find an interval I that is h-heavy at 0 and is not banned. By generating s samples from the distribution µ0 and taking the most frequently seen interval, we will successfully find an h-heavy interval with high probability (see Corollary 5.9 for the formal statement). 2. Let w denote the width of I, i. e., w = c - b where I = [b, c]. Our rough estimator (given by Lemma 5.10) only applies for 1 0 + 1/w (by convention 1/0 = ). Moreover, since we only need to reach a final inverse temperature of ln A, let L = min{0 + 1/w, ln A}. Now we concentrate on constructing a cooling schedule within (0 , L]. 3. Intuitively, we do binary search in the interval [0 , L] to find the maximum such that is h-heavy. We can use binary search because, by Lemma 5.5, the set of inverse temperatures for which an interval is heavy is an interval in R+ . (More precisely, we do binary search in the interval [0 , L] with predicate Is-Heavy(I, ) and precision = 1/(2n). We use the binary search procedure described in Remark 5.13.) 4. We now check if there is an "optimal" move within the interval B = (0 , (0 + )/2]. We want to find the maximum B satisfying (52) for u(0 , ), or determine no such exists. Let c1 = e2 and c2 = 3 · 106 for (52). To find such a , we do binary search and apply Lemma 5.10 to estimate the ratios Z(2 - 0 )/Z() and Z(0 )/Z(). Note for B we have 2 - 0 [0 , ], hence, the interval I is h-heavy at inverse temperatures 0 , and 2 - 0 and Lemma 5.10 applies.1
1 More precisely, we perform binary search with predicate Est(I, 0 , )·Est(I, 2 - 0 , ) 2000. 20 (a) If such a B exists, then we set as the next inverse temperature and we repeat the algorithm starting from . We refer to these steps as "optimal" moves. (b) If no such exists, then we can reach the end of the interval B as follows. There are two cases, either the interval was too wide for the application of Lemma 5.10, or the interval I stops being heavy too soon. More precisely, either: i. If = L, then we set (0 + )/2 as the next inverse temperature. Moreover, if < ln A we continue the algorithm starting from ; whereas if = ln A we are done. We refer to these steps as "long" moves. ii. Otherwise, we add the following inverse temperatures to our schedule: 3 7 1 0 + , 0 + , 0 + , . . . , 0 + (1 - 2-t ), 0 + , 2 4 8 where = - 0 and t = ln ln A. We add the interval I to the set of banned intervals Bad and continue the algorithm starting from . We refer to these steps as "interval" moves since the interval I will not be used by the algorithm again. Lemma 5.14 (Step 3). Assume that no failures occurred. After step 3 of the algorithm, the interval I is h-heavy for . Moreover, if = L, then the interval I is not 8h-heavy for . Lemma 5.15 (Step 4). Assume that no failures occurred. Then after step 4 of the algorithm Z(0 )Z(2 - 0 ) 3 · 106 . (63) Z()2 Moreover, if < ( + 0 )/2, then Z(0 )Z(2 - 0 ) e2 . Z()2 (64) 5.3 Bounding the length of the cooling schedule We first estimate |P |, the number of intervals in P . It is used to bound the number of interval moves. Proof of Lemma 5.3: Let i {0, . . . , n}. Suppose that the interval I containing i starts at b. Thus, by (55), the width of I is b/ ln A. Since i is in I we have 1 + ln A . (65) i b + b/ ln A b(1 + 1/ ln A) = b ln A We can lower bound the width of the interval containing i as follows (in the second inequality we use (65)): b ln A b ln A -1 21 i - 1. 1+ ln A If for each i {0, . . . , n} we take the width w of the interval containing i and add up the 1/(w + 1), we obtain the number of intervals. Thus, the total number of intervals is bounded as follows n 1 + ln A 1 + (1 + ln n)(1 + ln A) 4 ln A ln n. |P | 1 + (66) i
i=1 We now bound the number of long moves. Lemma 5.16. Assume that failures occurred during the algorithm. The number of no "long" steps is bounded by 26 ln A ln n. Proof : At most one "long" move can have L = ln A (because the algorithm stops at the inverse temperature ln A). Thus, we only need to estimate the number of "long" moves for which L = 0 + 1/w. Let xk be the total number of "long" moves for which the width of the interval I was k. Let k be the largest k such that xk is non-zero. Let yk = xk for k < k and let yk = xk - 1. Let k {1, . . . , k }. Let t = (yk + yk -1 + · · · + yk ). After t "long" moves the inverse temperature satisfies
k 0 i=k yi 2i (67) (0 would be equal to the right-hand side of (67) if we took the t shortest "long" moves). Note that xk + · · ·+ xk > t, and, hence, we still have to make a long step with the width of the h-heavy interval I at least k. This long step has to happen at an inverse temperature 0 , or higher. We will need the following property - for any interval [b, c] P of width w = c - b and any i [b, c] we have (68) i b w ln A. This follows directly from (55) (since the chose the width w to be b/ ln A). From (68) we have ai e-0 i Ae-0 k ln A . (69)
iI Assume the left-hand side of (69) is h. Then I is not h-heavy for any 0 , since for X µ ai e-0 i ai e-i iI h, Pr (H(X) I) = iI Z() Z() in the last inequality we used Z() a0 1, which is true for any . Thus, in the binary search in step 3) of the algorithm the Is-Heavy will always report false and will be about 0 + 1/2n (more precisely 0 + 1/2n). Hence < L, which implies that a "long" move with I of width k is impossible, a contradiction. 22 Thus, the left-hand side of (69) is h, and hence Ae-k By combining (70) and (67) we obtain
k i=k ln A h. (70) yi 2 ln(A/h) . 20 · i k ln A (71) Adding (71) for k = 1, . . . , k we obtain ln(A/h) ln(1/h) ln A + yi 2(1 + ln n) 4(ln n) ln A ln A i=1 By Lemma 5.3 and the definition of h (equation (56)) we have 1/h 32 ln A ln n. and hence (using our assumptions (24)) we obtain ln(1/h) 5 ln A. The total number of long moves is thus bounded
k k . (72) 2+
i=1 yi 2 + 24(ln n) ln A 26 ln A ln n. We now prove Theorem 5.1. Proof of Theorem 5.1: The number of "optimal" moves is bounded by 4 (ln A) ln n ln ln A, (73) see Corollary 4.4. Each "interval" move causes at most s = 2 ln ln A inverse temperatures to be output. Hence, by Lemma 5.3, the total number of inverse temperatures output by "interval moves" is bounded by 8 ln A(ln n) ln ln A. (74) Finally, the number of "long" moves is bounded by Lemma 5.16, and it is at most 26 ln A ln n. (75) The total number of moves is bounded by the sum of (73),(74),(75), which is bounded by 38 ln A(ln n) ln ln A. This proves (50). Let T = 38 ln A(ln n) ln ln A. The length of the output schedule is bounded by T and hence every step of the algorithm is executed at most T times. The binary search on step 23 3 starts with an interval of width at most ln A and works with precision 1/(4n). The total number of calls to Est is thus bounded by 2T log2 (8n ln A) 8T (ln n + ln ln A). (76) The total number of calls to Is-Heavy is certainly bounded by (76), since the starting interval has width at most ln A and works with precision only 1/(2n). Finally, the number of calls to Find-Heavy is at most T . A